TUHH 


Hamburg University of Technology 


RESEARCH PROJECT ‘THESIS 


Realtime GPGPU FFT 
Ocean Water Simulation 


INSTITUTE OF EMBEDDED SYSTEMS 


Haig Supervisor: 
F - 2 Prof. Dr. Karl-Heinz 
ynn-Jorin Flugge Fcineniats 


October 16, 2017 


STATUTORY DECLARATION 


I declare that I have authored this thesis independently, that I have not used other than 
the declared sources / resources, and that I have explicitly marked all material which has 
been quoted either literally or by content from the used sources. 


Winsen, October 16, 2017 


Fynn Fligge 


Contents 


On 


se 0 QA wD PSP 


Introduction 


Cooley-Tukey Fast Fourier Transform Algorithm 

2.1. HoursPomt DPW gate g oe ol eo eke hao dd kd ee ote oh ee a 
2:2. Hour Pomt=B ls esis, ao eu Ce ess BO eg ee Pe ee en ee 2 
2.3 From 4-Point to N-Point FFT ...........0 2.0.0... 000000002 200. 


The Statistical Ocean Waves Model 


Generating The Height Field 
4.1 Preparing the IFFT-Equation ............. 2.00.00 000 eee 
4.2 Computing the IFFT on GPU.... 2... 20.20.20... 002.0200 000000. 


4.2.1 
4.2.2 
4.2.3 
4.2.4 
4.2.5 
4.2.6 
4.2.7 
4.2.8 
4.2.9 


Outlook 


IRF? Algorithms «2.21.2 jelecte de ee Os ae es eo A oe EO es 
Complite Space «2.054052 Baka ee ee ee ee ae ae ee 
ho(k) and ho(—k) Spectrum Textures... 0.00.00. ee eee 
h(k,t) Fourier Components Texture... 0.00.00. 00000 eee 
Ping-Pong Texture... 2... ee 
Butterfly Texture... 2... ee 
Butterfly Shader ses. 2 eo ek Aen se PE Se A ee eed 
The-Ocean Height Field fist)? «32 3 eos 6 eee Ro Kee oO Ped 


Choppy: Waves. sir aot. eS as, be Bete eid es Mog Pek ee 


ho(k) and ho(—k) Compute Shader 


Fourier Components h(k,t) Compute Shader 


Butterfly Texture Compute Shader 


Butterfly Compute Shader 


Inversion and Permutation Compute Shader 


List of Figures 


Dok 
L2 
1.3 
1.4 
1.5 


ail 
a2 
2.3 
2.4 
20 
2.6 
Zi 
20 


4.1 
4.2 
4.3 
4.4 
4.5 
4.6 
4.7 
4.8 
4.9 
4.10 
4.11 
4.12 
4.13 
4.14 
4.15 


5.1 
5.2 


GPU acceleration in general software .........0..0.. 0.000008. 6 
GPUsve CPU performance comparism: 2.32.6 vidoe Age Ge ee Ge 7 
GPGPU post processing effects .. 1... 0... 8 
GPGPU sun light scattering and lens flare effect... ............ 8 
BE IG erierared Wy aber coo et as cee Ge DD eS we ee Bl Sele we eS SES 9 
‘Pwoepoiit buttertiy did gram. ¢ 2-2. bac eR wR a eS AE 1 
Two-point butterfly diagram with f[0], f[2) imput ...........002.. 12 
Two-point butterfly diagram with f[1], f[3) mput ...........0002.. 12 
Four-point butterfly diagram second stage ............0..008. 3 
Pour point: butberty- dia er ani aos ees Net ge Ge Be lee ee lk hw 13 
Four-poit butteriy diagram A. au sack ee ate ee oe ek eS 15 
Four-point butterfly diagram B .............-.2.022+200040.% Is 
Bight- point buttery aia grant: ot: be fete toe oars hone ey Shoe ee Bods 16 
Oceam. IB Overviews n% 1 27-0 eee gin) teas) yee oT etait a wakes OYE 21 
Fonizonial Vn PRS 2 abe a, shot ee ok Bh Bek Be AR ee eect 3 22 
Veriical ly BE 62. oc, eee ae eh ek RPS eh ee ee ae Ee 22 
IOS 5G COMmpule space 2: soth GS ae le ed ee ek eae Oe ee 23 
Hig Ue COG go Ae erat d cacti Wiasain 3 Sh ato ge Oh aes GA ee ga a oe ats 24 
Top wing butterfly operation... 2... 25 
Bottom wing butterfly operation... aac ee aioe Pa eR PE eS 26 
BULberily exes. x a Se Aue eecaeks Ble atS-4 Ue! bate a A tae a Bo 26 
Ocean wave height field with |k- |? factor. 2... 0 ee 29 
Ocean wave height field with |k-@|® factor... 0. 30 
Ao(k) texture with |&- [8 factor. 2. 30 
Rendered ocean surface without choppy waves ...............-. ol 
Horizontal choppy wave height field of the z-component........... 32 
Horizontal choppy wave height field of the z-component........... 32 
Rendered ocean surface with choppy waves ..............-..0-. 33 
Pr eenerated laidscape As 64.44 4-8 (ak a RO ARE SS 34 
FET generated landscape Bo a isa a aes wie AG Ale ae ae a es 35 


List of Abbreviations 


FFT 
IFFT 
DFT 
GPU 
CPU 


Fast Fourier Transform 

Inverse Fast Fourier Transform 
Discrete Fourier Transformation 
Graphical Processing Unit 


Central Processing Unit 


GPGPU General Processing on GPU 


CGI 


Computer Generated Imagery 


FLOPS floating point operations per second 


Chapter 1 


Introduction 


The Fast Fourier Transform (FFT) has a wide usage in signal processing applications, since 
the Discrete Fourier Transformation is essential for generating the spectrum or frequency 
domain of a signal resp. for computing the discrete convolution and correlation of signals. 
The IEEE magazine lists the FFT in the top ten of the most influential algorithms in the 
20th century and defines it as "the most ubiquitous algorithm in use today to analyze and 
manipulate digital or discrete data". [1] Actually the publication "An Algorithm for the 
machine calculation of complex Fourier series" by J.W. Cooley and J.W. Tukey in 1965 
introduced a new era in digital signal processing. [2] 

In contrast to the common usage of FFT’s this elaboration presents it’s application 
in a different manner. With FFT’s it is possible to simulate statistical based CGI water 
with a high degree of realism. Already in the early nineties in blockbuster movies such as 
Titanic and Waterworld, the CGI water was generated with FFT’s. However, to this times 
it was not possible to animate FFT generated water on common computer hardware in 
realtime. Only when computer architectures became more powerful in the early 2000s the 
first video games used realtime FFT generated water. But there was still a big gap in the 
degree of realism between video games and CGI movies. While Titanic and Waterworld 
used a 2D-FFT with a resolution of 2048x2048 [8, p. 2], the pioneering game Crysis by 
Crytek with it’s famous CryEngine applied a resolution of just 64x64, which lead to a big 
loss of quality compared to the high resolution CGI water in Titanic and Waterworld. The 
FFT’s in Crysis were generated on CPU on a dedicated thread [4, p. 15]. Thereby large 
resolutions became a bottleneck in realtime water rendering, since one FFT (three FF'T’s 
when Choppy-displacement is applied, more in chapter 4.2.9) has to be computed for each 
frame separately. Since a 2D-FFT is well parallelizable, a solution to this performance 
issue is shifting the computations to the GPU. In contrast to a sequential processing CPU 
with just a few cores, a GPU consists of a parallel architecture with thousands of small 
cores to handle lots of tasks concurrently. 


Already in the year 2001, once NVIDIA launched with it’s GeForce 3 series the first 
GPU’s with programmable shaders, the basis for GPU computing were laid. It still 
lasted until 2003, the starting point of the GPGPU-century, when the Stanford University 
presented it’s BrookGPU, the first GPGPU-API. Soon NVIDIA and AMD released their 
low-level-APIs CUDA and STREAM, but dedicated for their GeForce- and Radeon-Chips 
respectively. Due to dedications to GeForce (CUDA) and Radeon (STREAM) the two 
API's were not relevant in commercial usage for the market such as the video game 
industry. However, CUDA became very popular in researching area. In 2008 the Khronos 
Group released the platform-independent API OpenCL, which supports GeForce- and 
Radeon-Chips both and many more graphic processors such as the Cell-Processor used 
by Sony in it’s Playstation 8. With OpenCL it is even possible to use it in parallel with 
OpenGL or DirectX. With OpenGL 4.3 and Direct3D 11 releases, Compute-Shaders were 
introduced, which made it easier to use GPGPU directly in graphic software without 
relying on OpenCL. [5] [6] 

Meanwhile GPGPU has a wide range of usage in almost every kind of software for 
accelerating complex parallelizable computations as illustrated in figure 1.1. Figure 1.2 
shows the performance advantage of GPU’s towards CPU’s in FLOPS and GBit/s up to 
2014. While the Intel Ivy Bridge architecture reaches around 600 GFLOPS, the Geforce 
780 Ti performs with 5400 GFLOPS. In 2017 NVIDIA’s Geforce GTX 1080 Ti reaches 
a performance of 11 TFLOPS. [7] 


Application Code 


Rest of Sequential 
CPU Code 


Figure 1.1: Graphical representation how the GPU accelerates the code execution by shifting 
parallelizable computations to the GPU. [8] 
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Figure 1.2: Comparism of CPU and GPU performance measured in GFLOPS and GB/s over 
the past years. [9] 


In modern Game Engines GPGPU became not only essential for FFT generated water, 
but also for several post-processing effects! and deferred shading”. Figures 1.3 and 1.4 
showing rendered scenes with GPGPU generated post-processing effects. This elaboration 
targets the development of a realtime CGI ocean renderer with a GPGPU FFT. The 
next chapter presents the Radix-2 Cooley-Tukey FFT-algorithm. Following this, it is 
demonstrated how the Cooley-Tukey algorithm is used to animate realtime ocean water 
based on Jerry Tessendorf’s paper Simulating Ocean Water with an OpenGL GPGPU 
implementation. Figure 1.5 shows a rendered screenshot of a realtime FFT generated 
ocean surface. 

All rendered demos and screenshots presented in this elaboration were produced with 
Oreon Engine [10]. 


special effects that are applied on the rendered scene image 
2screen-space shading technique, where light and shadows are added to the rendered scene image 
after the geometry processing has finished 


Figure 1.3: Screenshot of a rendered Scene with post processing bloom, depth of field blur and 
motion blur effect. 


Figure 1.4: Screenshot of a rendered Scene with sun light scattering and lens flare effect. 


Figure 1.5: Screenshot of FFT Generated Water. 


Chapter 2 


Cooley-Tukey Fast Fourier 
Transform Algorithm 


This chapter introduces the Cooley-Tukey FFT-algorithm for computing the Discrete 
Fourier Transformation (DFT). The DFT of a discrete signal f with N samples 


f = (f[0], f[1],...,f[N — 1) (2.1) 
is defined by the N-tuple 
F = (F(0], F[/1],..., F[N — 1]), (22) 
where ae 
F[k] = So f[nle"* ,n =0,1,..,N—1. (2.3) 


Computing F of a discrete signal f with N samples in the naive way would require N 
complex multiplications and N — 1 complex additions for each element of F, which total 
time is proportional to O(n). The FFT algorithm reduces the complexity to O(n log n). 
For large N this leads to a massive performance advantage. One limitation of the FFT is, 
that the DFT must be of even order and for most efficiency N must be a power of 2. An 
option to execute FFT’s on signal data without N? samples is by adding zeros until the 
closest power of 2 is reached. This technique is known as "Zero Padding". [11, pp. 254, 
279-281, 291] 
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2.1 Four-Point DFT 


The basic concept of the FFT is to exploit the structure of the DFT by smartly rearrang- 
ing the products. Before the general procedure of the FFT algorithm will be presented 
in chapter 2.2, the computation of the DFT with N = 4 samples is demonstrated. 


Since 
etl? — 4 (2.4) 
F[k] can be simplified for N = 4 to 
PIR) = Yo = (a) ln) (2.5) 
which leads to 
F[k] = £[0] + (—2)* £[1] + (-1)* £[2] + @* £[3] . (2.6) 
Hence, F' is given by 
bh? ae f[0] + f[1]+f[2)4+ £[3] 
1 -—+ -l il, f[0] — if[1] — £[2] + if[3] 
Lop. sete = f[0] — f[1]+£[2] — £[3] }’ Cae 
1 «@ -1 -i f[0] + ¢f[1] — f[2] — if[3] 
which takes O(n*) multiplications for solving F. [11, p. 281] 
Rearranging the summands of F{k] to 
f[0]+ f[2)+f[1)+ £[3] 1 1 1 17/7 ¢8(0] 
f[0] — f[2] —7f[1]+<cf[3]} jl -¢ -1 3 f [2] 28 
flo} + £2) —e)— f13]) 7 }1 -1 1 -1/ | ep ee) 
£[0] — £[2] + éf [1] — é£[3] 1 ¢ -1 «| Lf] 


exposes a reduction of the computation costs by precomputing the subexpressions 
f[0] + £[2], f[0] — f[2], f[1] + £[3] and f[1] — f[3], since 


flo] + £2) + fl) + £3) (£[0] + £[2)) + ([1] + £[3)) 
f[0] — £[2] — ¢f[1] + ef[3] } _ | (£[0] — £[2]) — a(f[1] — £[3)) (2.9) 
f[0] + £[2] —f[1]— £3] (£[0] + £[2]) — (€[1] + £[3)) 
f[0] — £[2] + ¢f[1] — #f[3] (£[0] — £[2]) + «(€[1] — f[3)) 


The FFT exploits the possibility to save operations by rearranging the components and 
reusing intermediate results of the DFT. The next section presents the execution of the 
FFT for N=4. 
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2.2 Four-Point-FFT 


A two-point DFT consisting of two Complex multiplies and adds can be sketched with 
the following two-point butterfly diagram: 


a A=a+t+ab 


b B=a-ab 
s 


Figure 2.1: A two-point butterfly diagram consists of a single butterfly operation. A butterfly 
operation takes two complex numbers a and b as input and a fixed complex number a and 
outputs the complex values A and B. [12] 


Therefore, the subexpressions f[0| + f[2] and f[0] — f[2] can be depicted as a butterfly 
operation as follows: 


f[0] f[0] +f[2] 
‘e— - f10] - f[2) 


Figure 2.2: Two-point butterfly diagram with f[0], f[2] input. 


The butterfly operation for f[1] + f[3] and f[1] — f[3] is similar: 


f[1] f[1] +f[3] 


e— ~ fl1] - 13] 


Figure 2.3: Two-point butterfly diagram with f[1], f[3] input. 
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With divide and conquer the output of the previous two butterfly operations can be used 
for two further butterfly operations to obtain F': 


f[0]+f[2] f[0}+f[1}+f[2]+f[3] 


f[0]-f[2] f[0}-[2]- i (F{1]-f(3]) 


f[1]+f[3] f[0}+f[2]-(f[1]+f[3]) 


f[1]-f[3] f{0}-f[2]+ i (F{11-f13)) 


Figure 2.4: Four-point butterfly diagram second stage. 


Solving the DFT for N=4 with butterfly operations reduces the computation cost to 
O(nlogn), since eight complex multiplies and adds are executed (two for each sample 
f[k]). [12] 

Combining figures 2.2,2.3 and 2.4 to one graph results in the butterfly diagram for the 
four-point FFT: 


f[0] F[0] 
f[2] F[1] 
[1] F[2] 
[3] F[3] 


Figure 2.5: Four-point butterfly diagram. 
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2.3. From 4-Point to N-Point FFT 


The strategy of the FFT is that each component of the DFT is not computed separately, 
which would result in unnecessary repetitions of a (relative to N) large number of com- 
putations. Instead, the FFT does it’s computation in stages. Figures 2.2 and 2.3 are the 
butterfly operations of the first stage, while figure 2.4 is the second stage of the four-point 
FFT. Each stage takes N complex numbers from the previous stage’s output (resp. f 
in the first stage) and executes N/2 butterfly operations. The output of these butterfly 
operations are N complex numbers as input for the next stage (resp. F in the last stage). 
Since one stage involves NV/2 butterfly operations with two complex additions, two com- 
plex multiplications and one sign inversion (resp. multiplication by -1) for each butterfly 
operation, one stage can be executed in O(n) time. [12] [11, pp. 285-286] 

As seen in chapter 2.2 , FFT’s can be well illustrated with butterfly diagrams. Due to 
figure 2.5, it is obvious that the four-point FFT consists of two stages (log, N for N = 4) 
with 2 butterfly operations for each each stage (N/2 for N = 4). For a general formulation 
of the FFT the simplification e~’"/? = —i , from chapter 2.1 is not helpful. Therefore, a 
new notation is introduced, the twiddle factor [11, pp. 281-282] [13]: 
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Wy =e7'® . (2.10) 


The matrix representation of the four-point DFT by using the twiddle factor is given by 


a 
we we wi wel) (2.11) 
Wr We We We 
and with rearrangement of the components of f! 
Ww? wo w® wy F£fo] 
We Wi W2 We! | ffl] ; 
Wr Wr We We! L£f[3] 


Due to the Nth roots of unity of WS for k=0,...,N-1 the matrix representation can be 
simpliefied to 


[ Wa We We Wr] [ £19) ) 
We, W? Wi W? f [2] (2.13) 
fw Ww we Ww eal) | 
We Wr We Wye! Lf] 
since 
en =e NS ; (2.14) 
and thus 
Wr =WHeN Yme Z. (2.15) 


as defined in [11, pp. 281-282]. 
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The equivalent butterfly diagram to figure 2.5 with twiddle factors is depicted in 
figure 2.6. Alternatively the sign-inversions in the butterfly operations can be replaced 
by twiddle factors regarding to the Nth root of unity, since 


N 
kt 


—Wi=Wy ? . (2.16) 


The alternative butterfly diagram with replaced sign inversions is illustrated in figure 2.7. 


[0] F[0] 
f[2] F[1] 
[1] F[2] 
[3] F[3] 
f[0] F[0] 
f[2] F[1] 
[1] F[2] 
[3] F[3] 


Figure 2.7: Four-point butterfly diagram B. 


The FFT-implementation of this elaboration uses butterfly diagrams similar to 2.7. 
More on this will be presented in chapter 4.2. Due to the twiddle factors Wk, the mul- 
tiplicators within a butterfly-graph can be expressed in a gerneric way, which facilitates 
the construction of butterfly graphs for N > 4. But in order to build up the butterfly 
grah for N > 4 it is necessary to rearrange the DFT matrix resp. the input vector f! as 
done in the butterfly diagrams 2.5, 2.6 and 2.7. 
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Referring to the eight-point FFT buttefly diagram in figure 2.8, the rule for arranging 
the components of f' can be extracted. By inspecting the input samples it is apparent 
that the components f' in the initial stage are ordered bit-reversed. [11, pp. 289-291] 

The exponents k of the twiddle factors Wy have the form 


N 

k=n- Hatage (2.17) 
where n is the vertical index. Further the number of butterfly stages can be generically 
determined to log,n stages, since the butterfly-span is doubled after each stage to a 
maximum span of N/2 in the last stage. Thus, the total execution time of the FFT with 
log, n stages and O(n) for each stage is O(n logn). Due to the reason that the butterfly- 
span is doubled after each stage, the Cooley-Tukey FFT algorithm is a divide and conquer 
algorithm. Since the algorithm can only be applied on DFT’s with N as a power of two, 
it is also called Radix-2 FFT algorithm. [11, p. 288] 


[0] 
[4] 
f[2] 


F[0] 
F[1] 
F[2] 


[6] F[3] 
f[1] F[4] 
f[5] F[5] 
f[3] F[6] 


f[7] F[7] 


Figure 2.8: Eight-point butterfly diagram. [12] 
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Chapter 3 


The Statistical Ocean Waves Model 


This chapter introduces the statistical ocean waves model. The following presented ap- 
proach snythesizes a square surface of ocean waves from a Fast Fourier Transform on a 
statistical, emperically-based oceanographic spatial spectrum of the ocean surface. The 
generated patch can be tiled seamlessly over a large area. What the FFT produces, is 
the synthesized height displacement from octaves of sinusoidal waves at each point of 
the square at time t. The rapid FFT algorithm makes it possible to computationally 
decompose the ocean height field in trivial time. The height h(x,t) at a horizontal po- 
sition x = (x,z) at time t is defined in [3, p. 6] by the sum of sinusoids with complex 
amplitudes: 


Lia]. h(k,t) exp(ik-x) . (3.1) 


The wavevector k is a two-dimensional horizontal vector, which points in the direction of 
travel of the wave. The components k, and k, of k = (k,,k,) are defined in [8, p. 6] as 


Re = (2rnjyh) (3:2) 
and 
k, = (2am/L) (3.3) 
where n and m have bounds: 
—N/2<nm<N/2. (3.4) 


L is the horizontal dimension of the patch. The height amplitude Fourier components 
i(k, t) are generated by gaussian random numbers and a spatial spectrum. Oceanographic 
research showed with reference to statistical analysis, photographic and radar measure- 
ments of the ocean surface that the wave height amplitudes h(x, t) are approximately 
independent gaussian fluctuations with the Phillips spectrum P,(k) defined in [3, p. 6] 
by: 

_ ,eXP(=1/(KL)*) 9 0 


P,(k) = ATE 


(3.5) 


where 


Ba Ve Te. 


ies 


V is the wind speed, w is the direction of the wind and g = 9.8m/sec”, the gravitational 
constant of the earth. The factor lk - |? removes waves that are directed perpendicular 
to the wind direction vector. 

For suppressing very small waves with | << L the Phillips spectrum can be multiplied 
with the factor exp(—k?l?). 

The Fourier amplitudes are denoted as 


(ls) = a6 + 16) Pak (3.6) 


where €, and €; are independent numbers of a gaussian normal distribution with mean 0 
and standard deviation 1. [3, p. 6] 
Finally the dispersion relation 

w(k) = gk (320) 


is integrated into the equation of the Fourier amplitudes. [3, p. 5] The dispersion relation 
is the relationship between frequencies and magnitude of the water wavevectors k;, where 
g is the gravitational constant. 

With the dispersion relation, the Fourier amplitudes at time t are denoted in [3, p. 6] by 


h(k, t) = ho(k) exp(iw(k)t) + hé(—k) exp(—iw(k)t) . (3.8) 
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Chapter 4 


Generating The Height Field 


4.1 Preparing the IFFT-Equation 


In order to synthesize the ocean height field with the FFT, the DFT-equation is derived 
from (3.1). First, two variables k,1 are defined with the bounds 


O<kl<N-1. (4.1) 


Now k can be redefined with k and | as 


(4.2) 


—_ Qrk—aN 2l—a7AN 
7 Te "ob 
Since the Fourier components hk, t) are the spectral components of h(x, t), the inverse 
FFT (IFFT) must be applied on h(k,t) to obtain the ocean height field as the height 


amplitudes of the spatial domain. [3, p. 6] 
Consequently, the IFFT-equation as defined in [12, p. 272] is denoted by 


ley eee ee Inkn — 7N 2nlm — aN 
hamst) = sage YD Me exp (A) is (ame (4.3) 
0 =p N N 


With e’* = —1 the equation can be simplified to 


= 2ank Qrml 
h(n, m,t) rad vi (k,l, t)(—1)” exp (i ae (—1)™ exp (i ate . (4.4) 
R=O. T=0 N N 
By rearranging (4.4) the following 2D-IFFT equation is obtained: 
1 ie ie 2aml 2ank 
h SS (= 1)" ee (k,l, t) 3 (5 
(n,m,t) = ( re [on Ft sexo ( | | ex (a | (4.5) 
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4.2 Computing the IFFT on GPU 


This section presents the GPGPU implementation of the 2D-IFFT computation with 
OpenGL Compute Shaders. Compute Shaders were introduced in OpenGL 4.3 and are 
intended for computing data on GPU rather than drawing into the framebuffer'. The 
processing sequence of the Ocean simulation consists of various GPGPU stages and a 
final rendering stage. 2D textures with 32 bit RGBA format serve as data buffers for the 
initial, intermediate and output stages of the FFT. Overall five textures are used as data 
buffers during the FFT execution. The role of the different textures used for data storage 
are explained below. The output of the FFT resp. the ocean height field is written into 
a further texture. 

Section 4.2.1 gives an overview over the implementation of the IFFT algorithm. The 
following sections starting from 4.2.2 give a deeper insight into implementation details. 


4.2.1 IFFT Algorithm 


The IFFT implementation consists of five phases. The initial phase produces the spectrum 
textures containing the ho(k) and ho(—k) values. Phase two generates the texture holding 
the height amplitude Fourier components h(k, t). 

The third phase executes N horizontal 1D FFT’s with the Fourier components texture as 
input vectors. Each row of the input texture serves as an input vector for the 1D FFT’s. 
The output texture of the horizontal 1D FFT’s is the input for the next phase. 

In the fourth phase N vertical 1D FFT’s are executed with the output from phase three 
as input vectors. Each column of the input texture serves as an input vector for the 1D 
FFT’s. The final phase multiplies the amplitudes by (—1)™ and (—1)” and inverts the 
result with a mulitiplication by 1/N?. 


Initial Spectrum ho(k) and ho(—k) shaderpass: 
while isRunning do 

Fourier components h(k,t) shaderpass; 
pingpong := 0; 

for i=0 to i<log.N do 

horizontal butterfly shaderpass for stage 7; 
pingpong := pingpong++ mod 2; 

end 

for i=0 to i<log.N do 

vertical butterfly shaderpass for stage 2; 
pingpong := pingpong++ mod 2; 

end 

Inversion and permutation shaderpass; 


end 
Algorithm 1: 2D-IFFT Algorithm 


'The framebuffer is the data buffer for the pixels of the device display 
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Butterfly Texture 


Figure 4.1: Two textures holding the values of ho(k) and ho(—k) and a further texture (butterfly 
texture) holding informations to perform the butterfly operations are precomputed and stored 
in the GPU memory. At each time t the texture holding h(k,t) is updated. The IFFT procedure 
fetches all required data from the precomputed butterfly texture and the texture with i(k, t) 
values. The ouput is a greyscale image containing the ocean height field. 
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Figure 4.2: Illustration how the horizontal 1D-FFT’s are performed on the texture pixels. Each 
pixel row serves as input vector for the FFT stages. 
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Figure 4.3: Illustration how the vertical 1D-FFT’s are performed on the texture pixels. Each 
pixel column serves as input vector for the FFT stages. 
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4.2.2 Compute Space 


The compute space consists of a three dimensional space of work groups. Each work 
group itself has a three dimensional invocation space. The dimensions of the work group 
space with their invocation spaces must be specified by the user. Each invocation within 
the different work groups holds a uniquely defined three dimensional index vector, the 
gl_GlobalInvocationID. In the IFFT implementation the compute space is arranged in a 
way that each work group invocation computes exactly one pixel of the output texture. 
There are various possibilities to arrange the work groups. 

For example, with N = 256 a work group space with dimension (1 x 1 x 1) with invocation 
space (256 x 256 x 1) can be specified. Another option would be a work group space with 
dimension (16 x 16 x 1) and invocation space of (16 x 16 x 1). For very large N it should 
be considered that only the indiviual invocations within one work group are executed in 
parallel. Figure 4.4 illustrates the work group space with it’s invocation spaces. [14, pp. 
625-626] 


Invocation 


Local Work Group 


Global Work Group 


Figure 4.4: Illustration of a compute space consisting of 16 resp. (4 x 4 x 1) work groups with 
16 resp. (4 x 4 x 1) invocation each, which results in a 16 x 16 Compute Space with overall 256 
invocations. [14, p. 626] 
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4.2.3 ho(k) and ho(—k) Spectrum Textures 


One texture is needed for storing the initial spectrum from (3.6). The gaussian random 
numbers €, and €; are generated from two independent noise textures with the Box-Muller 
method. The real part of ho(k) is stored in the red channel while the imaginary part is 
stored in the green channel. A further texture is used to store hé(—k) with gaussian 
random numbers €, and €; generated from two further noise textures. Figure 4.5 shows 
the rendered ho(k) spectrum texture . The texture with hé(—k) looks similar. 


Figure 4.5: ho(k) texture 
Rendered image of the initial spectrum ho(k) texture with the setting: 
N = 256, L = 1000, = 0.5,w = (1,1), V =40,A=4 


4.2.4 h(k,t) Fourier Components Texture 


The time dependent h(k, t) Fourier components from equation (3.8) are stored in another 
texture. Again the real part is written in the red channel and the imaginary part is stored 
in the green channel. The time dependent Fourier components texture looks similar to 
figure 4.5. 
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4.2.5 Ping-Pong Texture 


The ping-pong texture is used for storing the output of the FFT computation stages. 
Actually two ping-pong textures are needed, but the Fourier components texture from 
the previous section 4.2.4 serves as the second ping-pong texture after the first FFT 
stage. The role of each ping-pong texture switches from input to output respectively 
after each stage. 

In the first stage the Fourier components texture serves as the input samples while the 
output samples are stored in the ping-pong texture. In the second stage the role of the 
textures are swapped. Now the ping-pong texture containing the output data of the first 
stage serves now as the input for the second stage, while the other ping-pong texture resp. 
the input texture of the first stage serves as the output buffer for the second stage. 


4.2.6 Butterfly Texture 


The butterfly texture stores all needed information to process the butterfly computations 
at the FFT stages. For the butterfly informations all four color channels of the texture 
are used. Unlike the previous textures, the size of the butterfly texture is N x logo N, 
since the horizontal dimension of the texture indicates the current FFT stage (out of 
log N stages), while the vertical dimension indicates the index of the butterfly output 
value within the FFT stages. 

The butterfly texture provides all necessary information for all FFT stages to perform the 
required butterfly operations. The red (real part) and green (imaginary part) channels 
holds the twiddle factor of the related butterfly. The input sample indices are stored in 
the blue and alpha channel. Figures 4.6 and 4.7 show the stored information related to a 
simple butterfly operation. 


Figure 4.6: In roder to perform the top wing butterfly operation to obtain a+W<., the twiddle 
factor is read from red and green channel while the input sample indices are read from the blue 
and alpha channel. The blue channel holds the index of the top input sample and the alpha 
channel holds the index of the bottom input sample. 
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Figure 4.7: In roder to perform the bottom wing butterfly operation to obtain a — Wko, the 
twiddle factor is read from red and green channel while the input sample indices are read from 
the blue and alpha channels. The blue channel holds the index of the bottom input sample and 
the alpha channel holds the index of the top input sample. 


The butterfly texture is generated by computing the information data for each output 
sample resp. each pixel dependent on the three variables N, the horizontal pixel coordinate 
(the FFT stage) and the vertical pixel coordinate (the output sample index). With pixel 
coordinates (x, y) the twiddle factor WS can be computed by just figure out the exponent 
k with 


2 BEE 
k= peel mod N , (4.6) 


since Wy is constant. 
The butterfly span is determined with 2”. If the inequality 


y mod 27") < 2 (4.7) 


evaluates to true, the current coordinate is the output of a top butterfly wing, otherwise 
bottom. With these information the second input sample index (stored in the alpha- 
channel) can be determined by adding the butterfly span to the current y-value of the 
current pixel coordinate if the inequality (4.7) evaluates to true, otherwise the butterfly 
span will be substracted. The following image shows a rendered butterfly texture for 
N = 256: 


Figure 4.8: Butterfly texture for N = 256 with resolution 256 x 8. For better visualization the 
the texture is streched onto a quad. 
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4.2.7 Butterfly Shader 


The Butterfly Compute-Shader is the core of the FFT implementation. The butterfly 
shader is used for both horizontal and vertical 1D-FFT executions. The uniform variable 
direction indicates whether horizontal or vertical butterfly operations are perfomed. Two 
further uniform variables stage and pingpong are used to determine the current stage of the 
FFT execution and which pingpong texture serves as input and output buffer. Depending 
on direction, stage and pingpong, the specific butterfly operations are executed. Listing 
4.1 shows the GLSL code of the horizontal butterfly operation. 


1 |// fetch butterfly information data 

2 |vec4 data = imageLoad(butterflyTexture , 

3 ivec2(stage, coord.x)).rgba; 
4 |// fetch first butterfly input sample 

5 |vec2 p_ = imageLoad(pingpong0 , 

6 ivec2(data.z, coord.y)).rg; 
7 |// fetch second butterfly input sample 

8 |vec2 q_ = imageLoad(pingpong0 , 

9 ivec2(data.w, coord.y)).rg; 
10 |// create complex numbers and Twiddle factor 
11 }complex p = complex(p_.x,p_.y); 

12 |complex q = complex(q_.x,q_.y); 

13 |complex w = complex(data.x,data.y); 

14 

15 |// perform butterfly operation 

16 JH = add(p,mul(w,q)); 


Listing 4.1: Horizontal Butterfly Operation 


The uniform variable stage indicates which horizontal coordinate is read from the but- 
terfly texture. The 2-dimensional vector coord specifies the global index of the compute 
space invocation resp. the gl_GlobalInvocationID. As indicated in 4.2.2 each compute 
shader invocation writes to exactly one pixel in the output image buffer. Hence, the in- 
dex of the output sample value is equivalent to the gl_GloballnvocationID. The butterfly 
information data is fetched from the butterfly texture in line two. The first parameter 
of tmageLoad specifies the texture name, the second parameter specifies the pixel coor- 
dinates. As mentioned in 4.2.6 the horizontal space of the butterfly texture represents 
the FFT stage. Hence, the horizontal coordinate of the fecthed pixel is the stage uniform 
variable. Since listing 4.1 shows a horizontal butterfly operation the vertical coordinate 
of the fetched pixel is the z-value resp. the horizontal value of the coord vector. 

The input sample values for the butterfly operation are fetched in line 5 and 8 from the 
first pinpong texture. Depending on the uniform pingpong the first or second pingpong 
texture is used to read the input values. The butterfly texture stores the index of the 
top and bottom butterfly input sample index in the blue and alpha channel. Therefore 
the x-coordinate of the fetched pixels are the z- and w-values from the fetched butterfly 
information data respectively. Accordingly the y-coordinate of the fetched pixel is the 
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vertical coordinate of the coord vector. The necessary complex numbers are initialized in 
lines 11, 12 and 18 in order to perform the butterfly operation in line 16. 

Listing 4.2 shows the similar GLSL code of the vertical butterfly operation. The only 
differences are the coordinates of the fetched pixels in line 2,5 and 8. Due to vertical 1D 
FFT’s, the y-coordinate of the fetched pixel from the butterfly texture is the y-value of 
the coord vector. Further the fetched input samples lie on the same pixel column of the 
texture. Accordingly the x-coordinates of the fetched pixel are the x-value of the coord 
vector and the y-coordinates are the z- and w-values from the fetched butterfly informa- 
tion data. 


// fetch butterfly information data 
vec4 data = imageLoad(butterflyTexture , 
ivec2(stage, coord.y)).rgba; 

// fetch first butterfly input sample 
vec2 p_ = imageLoad(pingpong0 , 

ivec2 (coord .x, -dataa ) jst: 
// fetch second butterfly input sample 
vec2 q_ = imageLoad(pingpong0 , 
9 ivec2(coord.x, data.w)).rg; 
10 |// create complex numbers and Twiddle factor 
11 |complex p = complex(p_.x,p_.y); 
12 }complex q = complex(q_.x,q_.y); 
13 |complex w = complex(data.x,data.y); 


CON DOK WNHFHR 


15 |// perform butterfly operation 
16 |H = add(p,mul(w,q)); 


Listing 4.2: Vertical Butterfly Operation 
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4.2.8 The Ocean Height Field h(x, t) 


The result of the 2D-IFFT is the greyscale ocean wave height field shown in figure 4.9. 
Brighter means higher and darker means lower height while negative height is represented 
in black, since the color values of rendered images ranges only from 0 to 1.0. By a more 
accurate consideration of the height field, it does not look really "wavy". To align the 
waves more with the wind direction, the factor |k- |? from equation (3.5) can be changed 
to |k- @|* for any arbitrary positive k. The higher the exponent k in |k - @|* will be set, 
the more the waves will be aligned with the wind direction. [3, p. 7] 

Figure 4.10 shows the greyscale ocean wave height field with the factor lk -w|>. The 
initial spectrum texture from 4.2.3 with |k - @|8 is shown in figure 4.11. 


7A. 


Figure 4.9: Ocean wave height field with |k - &|? factor. 
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Figure 4.10: Ocean wave height field with |k - &|8 factor. 


Figure 4.11: ho(k) texture with |k - o|8 factor. 
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4.2.9 Choppy Waves 


Up to this point one last thing needs to be solved to let the ocean surface appear more 
realistic. Using the wave height field from figure 4.10 for displacement of the y-component 
generates ocean waves with rounded peaks as illustrated in figure 4.12. Since naturally 
ocean waves have sharp wave peaks even with a weak wind, the ocean surface in figure 
4.12 does not look realistic. 


Figure 4.12: Rendered ocean surface without choppy waves. 


To solve this issue a horizontal displacement in both x- and z-direction needs to be 
applied besides the displacement of the y-component. Hence, two further 2D-IFFT’s are 
performed to generate the height fields for the horizontal displacement. For the horizontal 
height field generation the following Fourier amplitudes as defined in [3, p. 10] are used: 


Dt) = y ~ithlk, t) exp(tk - x) . (4.8) 


The only thing which needs to be adjusted compared to the vertical height field gen- 
eration, is to multiply the elements of the height amplitude Fourier components h(k, t) 
by the factor 


== (4.9) 


dl 


Figure 4.13 and 4.14 are the corresponding choppy waves textures to figure 4.10. Fig- 
ure 4.15 demonstrates a way more realistic appearing rendered ocean surface with height 
and choppy displacement. 


Figure 4.13: Horizontal choppy wave height field of the x-component. 


Figure 4.14: Horizontal choppy wave height field of the z-component. 
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Figure 4.15: Rendered ocean surface with choppy waves. 
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Chapter 5 
Outlook 


With the 2D-IFFT implementation not only dynamic ocean surfaces can be generated, 
but also huge terrain landscapes. By superimposing many precomputed IFFT-generated 
height fields with different scalings a natural terrain can be produced. 

Very huge and detailed terrain landscapes can be rendered in realtime by using an effi- 
cient Quadtree algorithm and the IFFT generated height fields. Screenshots of rendered 
realtime landscapes with different height field generations are shown in figures 5.1 and 
5.2. 


Figure 5.1: FFT generated landscape A 
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Figure 5.2: FFT generated landscape B 
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Appendix A 
ho(k) and ho(—k) Compute Shader 


##version 430 core 
#define M_PI 3.1415926535897932384626433832795 


layout (local_size_x = 16, local_size_y = 16) in; 


layout (binding = 0, rgba32f) writeonly uniform image2D tilde _hOk; 
layout (binding = 1, rgba32f) writeonly uniform image2D tilde _hOminusk; 


uniform sampler2D noise_r0; 
uniform sampler2D noise_i0; 
uniform sampler2D noise_rl; 
uniform sampler2D noise_il; 


uniform int N; 

uniform int L; 

uniform float A; 

uniform vec2 windDirection; 
uniform float windspeed; 


const float g = 9.81; 


// Box—Muller—Method 
vec4 gaussRND () 


{ 
vec2 texCoord = vec2(gl_GlobalInvocationID.xy)/ float (N); 
float noise00 = clamp(texture(noise_r0, texCoord).r, 0.001, 1.0); 
float noise01 = clamp(texture(noise_i0, texCoord).r, 0.001, 1.0); 
float noise02 = clamp(texture(noise_rl, texCoord).r, 0.001, 1.0); 
float noise03 = clamp(texture(noise_il, texCoord).r, 0.001, 1.0) 
float u0 = 2.0*M_PlI*xnoise00; 
float vO = sqrt(—2.0 * log(noise0l1)); 
float ul = 2.0*M_PlIx*xnoise02; 
float vl = sqrt(—2.0 * log(noise03)); 
vec4 rnd = vec4(v0 * cos(u0), vO * sin(u0), vl * cos(ul), vl * sin(ul)); 
return rnd; 

t 


void main(void) 


{ 


vec2 x = vec2(gl_GlobalInvocationID.xy) — float (N)/2.0; 
vec2 k = vec2(2.0 * M_PI * x.x/L, 2.0 * M_PI * x.y/L); 


float L_ = (windspeed * windspeed)/g; 
float mag = length(k); 

if (mag < 0.00001) mag = 0.00001; 
float magSq = mag * mag; 


//sart (Ph(k))/sqrt (2) 
float h0k = clamp(sqrt ((A/(magSq*magSq) ) 


x pow(dot(normalize(k), normalize(windDirection)), 6.0) 
* exp(—(1.0/(magSq * L_ * L_))) 
* exp(—magSq*pow(L/2000.0,2.0)))/ sqrt(2.0), 4000, 4000); 


// sqrt (Ph(—k))/sart (2) 
float hOminusk = clamp(sqrt ((A/(magSq*magSq) ) 


* pow(dot(normalize(—k), normalize(windDirection)), 6.0) 

* exp(—(1.0/(magSq * L_ * L_))) 

* exp(—magSq*pow(L/2000.0,2.0)))/ sqrt(2.0), -—4000, 4000); 
vec4 gauss_random = gaussRND (); 
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imageStore(tilde_h0Ok, ivec2(gl_GlobalInvocationID.xy), 
vec4(gauss_random.xy * hOk, 0, 1)); 


imageStore(tilde_hOminusk, ivec2(gl_GlobalInvocationID.xy), 
vec4(gauss_random.zw * hOminusk, 0, 1)); 


Listing A.1: ho(k) and ho(—k) Compute Shader 


37 


Appendix B 


Fourier Components h(k,t) Compute 
Shader 


##version 430 core 
#define M_PI 3.1415926535897932384626433832795 


layout (local_size_x = 16, local_size_y = 16) in; 
layout (binding = 0, rgba32f) writeonly uniform image2D tilde_hkt_dy; 
layout (binding = 1, rgba32f) writeonly uniform image2D tilde_hkt_dx; 
layout (binding = 2, rgba32f) writeonly uniform image2D tilde _hkt_dz; 
layout (binding = 3, rgba32f) readonly uniform image2D tilde _hOk; 
layout (binding = 4, rgba32f) readonly uniform image2D tilde _hOminusk; 
uniform int N; 
uniform int L; 
uniform float t; 
struct complex 
float real; 
float im; 
complex mul(complex c0, complex cl) 
complex c; 
c.real = cO.real * cl.real — cO.im * cl.im; 
c.im = cO.real *« cl.im + cO.im * cl.real; 
return c; 
complex add(complex c0, complex cl) 
complex c; 
c.real = cO.real + cl.real; 


c.im = cO.im + cl.im; 
return c; 


complex conj(complex c) 
complex c_conj = complex(c.real , —c.im); 
return c_conj; 


void main(void) 


{ 


vec2 x = ivec2(gl_GlobalInvocationID.xy) — float (N)/2.0; 


vec2 k vec2(2.0 * M_PI * x.x/L, 2.0 * M_PI * x.y/L); 


float magnitude = length(k); 
if (magnitude < 0.00001) magnitude = 0.00001; 


float w = sqrt(9.81 * magnitude); 


vec2 tilde _hOk_ values = imageLoad(tilde_hOk, 
ivec2(gl_GlobalInvocationID.xy)).rg; 


complex fourier_cmp = complex(tilde_hOk_values.x, tilde_hOk_values.y); 
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vec2 tilde hOminusk values = imageLoad(tilde_hOminusk , 
ivec2(gl_GlobalInvocationID.xy)).rg; 


complex fourier_cmp_conj = conj(complex(tilde_hOminusk_values.x, 
tilde_hOminusk_values.y)); 


float cos_w_t = cos(wst); 
float sin _w_t = sin(wxt); 


// euler formula 


complex exp_iwt = complex(cos_w_t, sin_w_t); 
complex exp_iwt_inv = complex(cos_w_t, —sin_w_t); 
// dy 


complex h_k_t_dy = add(mul(fourier_amp, exp_iwt), 
mul(fourier_amp_conj, exp_iwt_inv)); 


ji ax 
complex dx = complex(0.0,—k.x/magnitude ); 
complex h_k_t_dx = mul(dx, h_k_t_dy); 


// daz 
complex dy = complex(0.0,—k.y/magnitude ); 
complex h_k_t_dz = mul(dy, h_k_t_dy); 


imageStore(tilde_hkt_dy, ivec2(gl_GlobalInvocationID.xy), 
vec4(h_k_t_dy.real, h_k_t_dy.im, 0, 1)); 
imageStore(tilde_hkt_dx, ivec2(gl_GlobalInvocationID.xy), 
vec4(h_k_t_dx.real, h_k_t_dx.im, 0, 1)); 
imageStore(tilde_hkt_dz, ivec2(gl_GlobalInvocationID.xy), 
vec4(h_k_t_dz.real, h_k_t_dz.im, 0, 1)); 


Listing B.1: Fourier Components h(k,t) Compute Shader 


39 


Appendix C 


Butterfly Texture Compute Shader 


#tversion 430 core 
#tdefine M_PI 3.1415926535897932384626433832795 
layout (local_size_x = 1, local_size_y = 16) in; 
layout (binding = 0, rgba32f) writeonly uniform image2D butterflyTexture; 
layout (std430, binding = 1) buffer indices { 
int j[]; 
} bit_reversed; 
struct complex 
{ 
float real; 
float im; 
ti 
uniform int N; 
void main(void) 
{ 
vec2 x = gl_GlobalInvocationID.xy; 
float k = mod(x.y * (float(N)/ pow(2,x.x+1)), N); 
complex twiddle = complex( cos(2.0*M_Pl*k/float(N)), sin(2.0*M_PlI«xk/float(N))); 
int butterflyspan = int(pow(2, x.x)); 
int butterflywing; 
if (mod(x.y, pow(2, x.x + 1)) < pow(2, x.x)) 
butterflywing = 1; 
else butterflywing = 0; 
// first stage, bit reversed indices 
if (x.x = 0) { 
// top butterfly wing 
if (butterflywing == 1) 
imageStore(butterflyTexture , ivec2(x), 
vec4(twiddle.real, twiddle.im, 
bit_reversed.j[int(x.y)], 
bit_reversed.j[int(x.y + 1)])); 
// bot butterfly wing 
else 
imageStore(butterflyTexture , ivec2(x), 
vec4(twiddle.real , twiddle.im, 
bit_reversed.j[int(x.y — 1)], 
bit_reversed.j[int(x.y)])); 
t 
// second to log2(N) stage 
else { 
// top butterfly wing 
if (butterflywing == 1) 
imageStore(butterflyTexture , ivec2(x), 
vec4(twiddle.real, twiddle.im, 
x.y, x.y + butterflyspan)); 
// bot butterfly wing 
else 
imageStore(butterflyTexture , ivec2(x), 
vec4(twiddle.real, twiddle.im, 
x.y — butterflyspan, x.y)); 
t 
} 


Listing C.1: Butterfly Texture Compute Shader 
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Appendix D 


Butterfly Compute Shader 


##version 430 core 
#define M_PI 3.1415926535897932384626433832795 


layout (local_size_x = 16, local_size_y = 16) in; 

layout (binding = 0, rgba32f) readonly uniform image2D butterflyTexture; 
layout (binding = 1, rgba32f) uniform image2D pingpong0; 

layout (binding = 2, rgba32f) uniform image2D pingpongl; 


uniform int stage; 
uniform int pingpong; 
uniform int direction; 


struct complex 


float real; 
float im; 


complex mul(complex c0, complex cl) 


complex c; 

c.real = cO.real * cl.real — cO.im * cl.im; 
c.im = cO.real *« cl.im + cO.im * cl.real; 
return c; 


complex add(complex c0, complex cl) 


complex c; 

c.real = cO.real + cl.real; 
c.im = cO.im + cl.im; 
return c; 


void horizontalButterflies () 


{ 
complex H; 
ivec2 x = ivec2(gl_GlobalInvocationID.xy); 


if(pingpong == 0) 
vec4 data = imageLoad(butterflyTexture , ivec2(stage, x.x)).rgba; 
vec2 p_ = imageLoad(pingpong0O, ivec2(data.z, x.y)).rg; 
vec2 q_ = imageLoad(pingpong0O, ivec2(data.w, x.y)).rg; 
vec2 w_ = vec2(data.x, data.y); 


complex p = complex(p_.x,p_.y); 


complex q = complex(q_.x,q_.y); 
complex w = complex(w_.x,w_.y); 


H = add(p,mul(w,q)); 


imageStore(pingpongl, x, vec4(H.real, H.im, 0, 1)); 


else if(pingpong == 1) 
{ 
vec4 data = imageLoad(butterflyTexture , ivec2(stage, x.x)).rgba; 
vec2 p_ = imageLoad(pingpongl, ivec2(data.z, x.y)).rg; 
vec2 q_ = imageLoad(pingpongl, ivec2(data.w, x.y)).rg; 
vec2 w_ = vec2(data.x, data.y); 


complex p = complex(p_.x,p_.y); 
complex q = complex(q_.x,q_.y); 
complex w = complex(w_.x,w_.y); 
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H 


add(p, 


imageStore(pingpongO, 


} 


void verticalButterflies () 


{ 


complex H; 
ivec2 x = ivec2(gl 


if(pingpong == 0) 
{ 


vec4 data 


mul(w,q)); 


x, 


_ GlobalInvocationID.xy); 


= imageLoad(butterflyTexture , 


vec4(H.real, H.im, 


0, 1)); 


ivec2 (stage, 
data.z)).rg 


x.y)).rgba; 


3 


data.w)).rg; 


vec2 p_ = imageLoad(pingpongO, ivec2(x.x, 
vec2 q_ = imageLoad(pingpongO, ivec2(x.x, 
vec2 w_ = vec2(data.x, data.y); 


complex 
complex 
complex 


H 


imageStore(pingpongl , 


else 


{ 


if (pingpon 


gav 


vec4 data 


vec2 p_ 
vec2 q_ 
vec2 w_ 


complex 
complex 
complex 


q 
w 


complex(p_.x 


sP_-y); 


H 


add(p, 


complex(q_.x,q_.y); 
complex(w_.x,w_.y); 


add(p,mul(w,q)); 


x, 


1) 
= imageLoad(butterflyTexture , 
imageLoad(pingpongl , 
imageLoad(pingpongl , 
vec2(data.x, data.y); 


complex(p_.x,p_.y); 
complex(q_.x,q_.y); 


complex (w_.x,w_.y); 


mul(w,q)); 


vec4(H.real, H.im, 


ivec2(x.x, 
ivec2(x.x, 


0, 1)); 


ivec2 (stage, 
data.z)).rg 
data.w)).rg 


x.y)).rgba; 


3 


3 


imageStore(pingpongO, x, vec4(H.real, H.im, 0, 1)); 


t 


void main(void) 


{ 


if( direction 0) 
horizontalButterflies (); 

if( direction 1) 
verticalButterflies (); 


else 


Listing D.1: Butterfly Compute Shader 
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Appendix E 


Inversion and Permutation Compute 


Shader 


##version 430 core 
layout (local_size_x = 16, local_size_y = 16) in; 
layout (binding = 0, rgba32f) writeonly uniform image2D displacement ; 
layout (binding = 1, rgba32f) readonly uniform image2D pingpong0; 
layout (binding = 2, rgba32f) readonly uniform image2D pingpong1; 
uniform int pingpong; 
uniform int N; 
void main(void) 
{ 
ivec2 x = ivec2(gl_GlobalInvocationID.xy); 
float perms[] = { 1.0, -1.0 }; 
int index = int(mod((int(x.x + x.y)),2)); 
float perm = perms[index]; 
if(pingpong == 0) 
{ 
float h = imageLoad(pingpongO, x).r; 
imageStore(displacement , x, vec4(perm*(h/float (N«*N)) , 
permx«(h/float (N*N)) , 
permx«(h/float (N*xN)) , 
1)); 
else if(pingpong == 1) 
{ 
float h = imageLoad(pingpongl, x).r; 
imageStore(displacement , x, vec4(perm*(h/float (N*N)) , 
permx(h/float (N*N)) , 
permx«(h/float (N*N)) , 
1)); 
t 
a: 


Listing E.1: Inversion and Permutation Compute Shader 
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